Skip to content

Merge dev into main for 5.0 release.#1779

Merged
tfenne merged 17 commits into
masterfrom
dev
May 1, 2026
Merged

Merge dev into main for 5.0 release.#1779
tfenne merged 17 commits into
masterfrom
dev

Conversation

@tfenne
Copy link
Copy Markdown
Member

@tfenne tfenne commented May 1, 2026

Description

Merge dev into main for 5.0 release. All code has previously been through PRs into dev.

tfenne and others added 17 commits April 25, 2026 06:12
#1758)

* Fix SamLocusIterator so that read position is not incorrectly offset after transiting insertions that fail the base quality check.
… improvement in performance (#1764)

* Optimize BAM nibble-to-ASCII base decoding with bulk lookup table.

Replace the per-nibble method call chain in SAMUtils.compressedBasesToBytes
with a 512-byte pre-computed lookup table (NIBBLE_PAIR_LOOKUP) that decodes
two bases per iteration. Ported from htslib's code2base approach.

The previous implementation made 4 method calls per base pair, each involving
a try/catch and array lookup. The new version does a single table lookup per
compressed byte with no method calls, halving loop iterations.

Benchmarks show ~7% speedup on pure BAM reading.

Remove dead methods compressedBaseToByte, compressedBaseToByteLow, and
compressedBaseToByteHigh which are no longer called. Add tests for odd-length
sequences, single base, empty sequence, offset, and round-trip encoding.

* Eliminate redundant reference name/index resolution during BAM decoding.

BAMRecord construction was performing ~10 dictionary lookups per record
(index->name->index round-trips for both ref and mate, then again via
a redundant setHeader() call in BAMRecordCodec.decode()). This reduces
it to 2 lookups per record (one resolveNameFromIndex each for ref and
mate) by:

- Adding package-private setReferenceNameAndIndex/setMateReferenceNameAndIndex
  methods to SAMRecord that set both fields directly
- Using these in the BAMRecord constructor instead of setReferenceIndex/
  setMateReferenceIndex which round-trip through name resolution
- Removing the redundant setHeader(header) call in BAMRecordCodec.decode()
  since the header is already set in the BAMRecord constructor

Profiling showed reference resolution at ~8% of CPU samples during BAM
reading. Benchmarking on a 12.3GB BAM (207M records) confirmed a ~7-8%
wall-clock improvement (104.2s -> 96.2s).

* Optimize BAM tag decoding: single-pass string reads and skip redundant validation.

Two targeted optimizations to the BAM tag decoding path:

1. BinaryTagCodec.readNullTerminatedString: replace the double-pass
   approach (scan forward for null, reset, re-read) with a single-pass
   scan over the backing byte array, eliminating the intermediate byte[]
   allocation per string tag.

2. SAMBinaryTagAndValue/SAMBinaryTagAndUnsignedArrayValue: add
   package-private constructors that skip isAllowedAttributeValue()
   validation, used from BinaryTagCodec.readTags() where the value
   types are known to be valid from the BAM type codes.

Profiling with eager decode showed tag decoding at 18.2% of CPU.
Benchmarking on a 3.9GB BAM (52M records) showed a ~3% improvement
in the eager-decode path (40.0s -> 38.9s) with no regression in the
lazy path.
#1768)

* Integrate jlibdeflate for faster DEFLATE compression and decompression

Add optional jlibdeflate (libdeflate JNI bindings) support that is automatically
used when the native library is available on the classpath, with transparent
fallback to the JDK Inflater/Deflater.

- DeflaterFactory/InflaterFactory auto-detect jlibdeflate availability at
  first use, caching the result for the lifetime of the JVM
- LibdeflateDeflater/LibdeflateInflater extend JDK Deflater/Inflater and
  correctly handle both raw DEFLATE (nowrap=true) and zlib format (nowrap=false)
- New Defaults.USE_LIBDEFLATE (samjdk.use_libdeflate) defaults to true;
  set to false to force JDK-only compression
- jlibdeflate 0.1.0-SNAPSHOT added as a dependency from Sonatype Central snapshots
…build (#1769)

* Update maven central publishing to the new central portal and document release publication steps in CONTRIBUTING.md

* Upgrade Gradle to 9.4.1, and latest release version of all plugins; fix build deprecations
…1753)

Add UnsatisfiedLinkError to the catch clause in tryToLoadSnappy()
to handle cases where native library loading fails (e.g., musl/glibc
detection issues in snappy-java 1.1.10.8+).
Initial implementation of CRAM 3.1 write support, contributed by Chris
Norman. Adds a naive write profile that writes valid CRAM 3.1 files
without yet using any of the format's new specialised codecs (FQZComp,
Name Tokeniser, RangeCoder, etc.); all data series continue to use
CRAM 3.0 codecs (rANS, GZIP). This squashes the four prior commits:

  - Enable a naive CRAM 3.1 write profile.
  - Changes based on instrumented trial runs.
  - CRAM 3.1 write tests and code cleanup.
  - Temp fix for preSorted=false default.

This commit serves as the foundation that the following commit builds
on to add the full CRAM 3.1 codec set, profile system, and performance
optimisations.

Co-authored-by: Tim Fennell <tim@fulcrumgenomics.com>
Builds out CRAM 3.1 write support on top of cnorman's initial naive
profile, taking htsjdk to feature parity with samtools/htslib for
writing CRAM 3.1 and matching samtools on cross-implementation fidelity
across a broad range of real-world data.

CRAM 3.1 codecs and supporting infrastructure
---------------------------------------------
- Full CRAM 3.1 codec set: rANS Nx16 (Order 0/1, RLE, Stripe, Pack,
  Cat), FQZComp (quality scores), Range adaptive arithmetic coder, and
  the Name Tokeniser.
- New compression profile system (FAST, NORMAL, SMALL, ARCHIVE),
  matching htslib's `--output-fmt-option fast/small/archive`. Each
  profile picks per-DataSeries compressors and (for SMALL/ARCHIVE)
  trial-compression candidate sets.
- TrialCompressor: a wrapper that tries multiple compressors per block
  and keeps the smallest output. Replaces the ad-hoc tag triple-
  compression path with a uniform, profile-driven mechanism.
- DataSeries content IDs aligned with htslib so CRAM dumps from the
  two implementations are directly comparable.

Codec performance work
----------------------
- rANS codecs reworked to use a `byte[]` API and backwards-write,
  removing per-byte stream overhead and several layers of indirection.
- GzipCodec uses Deflater/Inflater directly instead of going through
  GZIPInputStream/OutputStream, avoiding gzip framing overhead per
  block.
- Name Tokeniser encoder: hand-written tokenisation replaces regex,
  per-type flags + STRIPE + duplicate-stream elimination + all-MATCH
  fast path significantly improve both speed and ratio.
- FQZComp / Range coder / rANS encoder hot paths tightened.
- Pooled RANSNx16Decode instance in the Name Tokeniser to avoid
  reallocating per call.
- Replaced ByteArrayInputStream/OutputStream with unsynchronized
  CRAMByteReader/Writer to remove monitor overhead in tight loops.
- Cached SAMTag key metadata to eliminate per-record String allocation
  during CRAM decode.
- Fused read-base restoration, CIGAR building, and NM/MD computation
  into a single pass instead of three.
- jlibdeflate compatibility fix on the ByteBuffer path.
- Roughly 15% faster encoding overall vs the pre-optimisation state,
  with read decoding gains in line.

Correctness fixes found during cross-implementation validation
--------------------------------------------------------------
- TLEN is now computed using htslib's coordinate-extent rule
  (max(end1,end2) - min(start1,start2) + 1, signed by leftmost
  position) for CRAM-specific compute, rather than the SAM 5'-to-5'
  rule. Without this, every read with a supplementary alignment
  mismatched on TLEN through any CRAM round-trip.
- CompressionHeader serialisation now uses a growable
  ByteArrayOutputStream rather than a fixed 100 KB ByteBuffer for the
  preservation map and tag encoding map. The TD tag dictionary alone
  can exceed 100 KB for rich tag sets (PacBio/Ultima flow-space, ONT
  modified-base tags) under the archive profile, where it would
  previously throw BufferOverflowException.
- Slice flush now uses a dual record/bases threshold matching htslib's
  `seqs_per_slice` AND `bases_per_slice` (default = readsPerSlice *
  500). Without this, archive-profile encoding of long-read data
  (PacBio HiFi 15 kb, ONT 30 kb+) would buffer ~1+ GB of quality
  scores per slice and OOM the FQZComp encoder.
- Strip NM/MD on encode and regenerate them on decode (matching
  htslib's default `store_nm=0`/`store_md=0` behaviour). Implemented
  attached mate pairs to align the in-memory representation with the
  spec.
- Restored CIGAR reconstruction when SEQ is `*` (CF_UNKNOWN_BASES).
- Fixed crash when reading containers that contain no slices.
- Removed unnecessary content digest tags from CRAM slice headers
  (htslib doesn't write them either; htsjdk's verification on read
  was overly strict).

Tools
-----
- `CramConverter`: a small command-line tool for converting between
  SAM/BAM/CRAM, primarily for benchmarking and exercising profiles.
- `CramStats`: dump per-block compression statistics from a CRAM file
  for debugging and ratio analysis.

Tests and CI
------------
- New hts-specs CRAM 3.0/3.1 compliance tests covering decode, index
  query, and round-trip behaviour.
- FQZComp round-trip tests using the hts-specs quality data files.
- CRAI index query correctness tests; codec roundtrip property tests.
- Split CRAM31 fidelity tests into per-profile classes for parallel
  execution.
- Reduced memory pressure in the unit tests to eliminate intermittent
  OOM failures on CI.
- Sped up several long-running CRAM tests by caching test data,
  reducing slice-size matrices, and downsampling fixtures.
- Misc: fixed a thread-safety bug in VariantContextTestProvider that
  was producing non-deterministic test counts.

CHANGELOG / README updated for the 5.0.0 release.

Co-authored-by: Chris Norman <cnorman@broadinstitute.org>
Snakemake pipeline that round-trips a curated set of public BAM/CRAM
files through htsjdk and samtools and verifies output equivalence
across all four CRAM 3.1 compression profiles (FAST, NORMAL, SMALL,
ARCHIVE) and all four encode/decode combinations. The pipeline was
used to validate the CRAM 3.1 write implementation in the previous
commit against samtools 1.21 on real-world data spanning five
sequencing platforms (Illumina, PacBio HiFi, ONT, Element AVITI,
Ultima), seven assay types (WGS, WGBS, RNA-seq, scRNA-seq, exome,
Hi-C, amplicon), and two organisms.

Pipeline (validation/)
----------------------
- Snakefile: encode original -> CRAM with both htsjdk and samtools,
  decode each CRAM with both, then compare each decoded BAM against
  the original. 4 profiles * 4 encoder/decoder combinations = 16
  comparisons per sample.
- Two sample-sheet formats:
    samples.tsv      - local file paths (3 columns)
    test_datasets.tsv - remote URLs auto-downloaded (9 columns)
  Format is auto-detected from header. Reference URLs may be a
  comma-separated list, with optional `#contig_name` suffix to rename
  a single-sequence FASTA's header on download (e.g. Lambda phage
  chrL spike-in for bisulfite samples).
- Intermediate files (BAMs and CRAMs) are marked temp() so they are
  cleaned up as soon as their dependents finish; rule priorities push
  the DAG to drain depth-first, keeping peak disk bounded.
- Per-rule log directives, retry/escalating-memory ladder for the
  Java jobs (4 GB -> 6 GB -> 8 GB -> 10 GB), and pixi.toml for a
  reproducible toolchain.

CramComparison
--------------
A general-purpose record-by-record SAM/BAM/CRAM comparison utility
used by the pipeline as the comparator. Tolerates the universal CRAM
roundtrip normalisations:
  - CIGAR =/X collapsed to M before comparison (CRAM stores match/
    mismatch in read features, so the =/X distinction is lost on
    decode in both htsjdk and samtools).
  - mapQ ignored for unmapped reads (SAM spec leaves it undefined,
    and both implementations normalise it to 0 on decode).
  - Auto-generated MD/NM tags stripped when one side is missing them.
  - Unsigned B-array type differences tolerated (CRAM stores signed).
Exit code 0 on both match and mismatch; exit 2 only on actual error,
so Snakemake preserves the result file in either case.

Co-authored-by: Claude <noreply@anthropic.com>
…ields (#1762)

* Changed SAMRecord.toString() to emit the SAM format string with all fields.
* Drop sync + trailing newline from SAMRecord.getSAMString.
* Remove deprecated SAMRecord.format() and now-dead helpers.

The text formatting path used by SAMRecord.toString() / getSAMString() has been
cleaned up:

- SAMTextWriter.getSAMString no longer uses a JVM-wide synchronized static cache
  of a shared StringWriter/SAMTextWriter. It now allocates a fresh StringWriter
  per call. Without this change, every toString() call would have taken a global
  monitor.
- SAMTextWriter.writeAlignment is split into a private writeAlignmentNoNewline
  plus a thin wrapper that appends '\n'. getSAMString uses the no-newline
  variant, so the result no longer needs to be trimmed.

BREAKING CHANGE: SAMRecord.getSAMString() no longer terminates the returned
String with a newline character. This brings SAMRecord into line with the other
getSAMString() implementations (SAMSequenceRecord, SAMReadGroupRecord,
SAMProgramRecord, SAMFileHeader), which already return newline-free strings.
Callers that relied on the trailing '\n' as a separator (e.g. concatenating two
records' SAM strings) must now insert their own separator. Callers that stripped
the newline with .trim() can drop the call.

BREAKING CHANGE: SAMRecord.format() has been removed. Callers should use
SAMRecord.getSAMString() instead.
* Make snapshot version names reflect the next planned release.

Previously the build appended "-SNAPSHOT" to git-describe output, so all
snapshots between releases were named after the *previous* release (e.g.
"4.3.0-3-gabcdef0-SNAPSHOT"), the opposite of Maven convention where SNAPSHOTs
are previews of the next release.

The fix: declare a `nextVersionBump` ("x" / "x.x" / "x.x.x") at the top of
build.gradle, and compute the next semver from the most recent release tag
plus that bump. Snapshots are then named "<nextVersion>-<shortHash>-SNAPSHOT"
(e.g. "5.0.0-23c681a8dc-SNAPSHOT"); including the hash makes each snapshot a
distinct, pinnable artifact.

Release builds (-Drelease=true) require HEAD to be on a semver tag and the
working tree to be clean; the tag is the version (nextVersionBump is ignored
on release — the tag is authoritative).

nextVersionBump is set to "x" since the next release will be a major one.

* Remove dead Ant build and Travis CI config.
This is a mechanical reformat of every .java file under src/ via the
Palantir Java Format style. There are no behavior changes.

The follow-up commit adds the Spotless Gradle plugin that produced this
output, wires it into the build as an enforcement step, updates the
contributor docs, deletes the obsolete java-style-eclipse.xml and
java-style-intellij.xml style guides, and adds .git-blame-ignore-revs
so this commit is skipped by `git blame`.

The single non-mechanical change in this commit is the removal of one
stray "// the imports for unit testing." comment from
VariantContextUnitTest.java that was preventing the formatter from
treating the surrounding imports as a contiguous block. The comment
carried no information.

Part of #1761.
…1761)

All build, CI, and contributor-facing changes for the new code-style
enforcement. The bulk reformat itself is in the previous commit; this
commit only contains setup.

build.gradle:
- Add the com.diffplug.spotless plugin (8.4.0), configured with
  palantirJavaFormat() over src/**/*.java. There are no formatter
  knobs -- the formatter is the style guide.
- Make compileJava depend on spotlessJavaApply so any unformatted
  source is auto-rewritten in place during the build. Contributors
  don't have to remember to run a format command -- just build, and
  the source ends up formatted. spotlessJavaApply is sub-second warm
  thanks to Gradle's up-to-date checking, so the cost on every build
  is negligible.

.github/workflows/tests.yml:
- Add a `formatCheck` CI job running `./gradlew spotlessCheck`
  (verify-only, no mutation) so unformatted code can't slip past CI.
  The local auto-format is a convenience; CI is the enforcement
  boundary.

CONTRIBUTING.md:
- New "Code Style" section: how the auto-format works, when/why you
  might still want to run `spotlessApply` directly, that CI verifies
  rather than rewrites, and how to opt in to .git-blame-ignore-revs
  locally.

.git-blame-ignore-revs:
- New file pointing at the bulk-format commit so `git blame` (and the
  GitHub web blame view, which honors the file automatically) credits
  the original author of each line rather than the reformat commit.

Deletions:
- java-style-eclipse.xml and java-style-intellij.xml -- stale,
  unreferenced by the build, and inconsistent with Palantir's output.

Closes #1761.
The NCBI SRA reader, backed by the gov.nih.nlm.ncbi:ngs-java native
library, has been a recurring source of build/runtime friction and is
being removed for v5.0.0. Downstream consumers that need SRA must
implement it themselves or use a different library.

Removed:
- gov.nih.nlm.ncbi:ngs-java api dependency from build.gradle.
- The "sra" TestNG group from build.gradle's test and testExternalApis
  tasks (and from htsjdk.TestDataProviders.EXCLUDED_GROUPS).
- The -Dsamjdk.sra_libraries_download=true JVM arg from
  testExternalApis. The corresponding Defaults.SRA_LIBRARIES_DOWNLOAD
  field is removed.
- 10 main sources under htsjdk/samtools and htsjdk/samtools/sra/
  (SRAFileReader, SRAIterator, SRAIndex, SRAAccession, SRALazyRecord,
  SRAAlignmentIterator, SRAUnalignmentIterator, SRAUtils, ReferenceCache,
  SRAIndexedSequenceFile).
- 7 test classes under htsjdk/samtools/sra/ and the test_archive.sra
  resource.
- SamReader.Type.SRA_TYPE.
- The README's SRA license-attribution sentence.

Modified (SRA hooks removed):
- SamInputResource: drops InputResource.Type.SRA_ACCESSION, the
  abstract asSRAAccession() method and all its overrides, the
  SRAInputResource inner class, and the SamInputResource.of(SRAAccession)
  factory.
- SamReaderFactory: drops the SRA dispatch in the resource type switch,
  the isSra() autodetection branch, the abstract
  applyTo(SRAFileReader, ...) method on Option and all five overrides
  of it, plus the instanceof SRAFileReader branch.
- SamReaderFactoryTest, SamReaderTest, ReadsBundle (TODO comment): minor
  follow-on edits where these referenced the removed types.

Verification:
- ./gradlew compileJava compileTestJava: passes.
- ./gradlew test: 21,877 / 21,877 pass (down from 21,936 on dev; the
  difference is the removed sra test classes plus DataProvider-discovered
  test paths that referenced the sra group).
- ./gradlew spotbugsMain spotbugsTest: clean, no new findings.
- ./gradlew spotlessCheck: clean (autoformat applied during compile).
- Stragglers: grep'd src/, build.gradle, README, CONTRIBUTING, CHANGELOG,
  and .github/ for "SRA" tokens, "import ngs.", "sra" string literals,
  "ngs-java", and "sra_libraries" -- no remaining references.

CHANGELOG entry deferred to the dedicated v5.0.0 CHANGELOG task.
)

Slim htsjdk's published runtime classpath by making the JavaScript
engine an opt-in dependency, and clean up around the change. Net effect
for consumers of the v5.0.0 artifact: 6 fewer jars on the runtime
classpath (nashorn-core + 5 ASM transitives, ~2.5 MB) and one less
"misleading direct dependency" in the published POM.

** BREAKING CHANGE **

Consumers using htsjdk.samtools.filter.JavascriptSamRecordFilter or
htsjdk.variant.variantcontext.filter.JavascriptVariantFilter must now
add a JSR-223 "js" engine to their own runtime classpath. The
recommended choice is OpenJDK Nashorn:

    Gradle:  runtimeOnly 'org.openjdk.nashorn:nashorn-core:15.7'

    Maven:   <dependency>
               <groupId>org.openjdk.nashorn</groupId>
               <artifactId>nashorn-core</artifactId>
               <version>15.7</version>
               <scope>runtime</scope>
             </dependency>

If no engine is on the classpath at runtime, AbstractJavascriptFilter's
constructor now throws a RuntimeScriptException whose message names the
calling filter class and prints both Gradle and Maven coordinates, so
the failure is fully self-describing.

The breaking change will be called out in the dedicated v5.0.0
CHANGELOG / breaking-changes PR alongside the other v5.0.0 changes.
Expands the existing 5.0.0 stub (which previously covered only the CRAM
3.1 write work) into a complete entry covering everything since 4.3.0.

Adds:

- Lead headlines summarizing the major themes (CRAM 3.1 writing, slimmer
  runtime deps, faster BAM [de]compression, enforced formatting, fixed
  test reporting).
- A prominent ⚠️ Breaking changes section calling out SRA removal,
  Nashorn now opt-in, the SAMRecord.toString() format change, the
  removed CRAM slice digest tags, and the new default CRAM version 3.1.
- A new "CRAM correctness and cross-implementation fixes" section
  consolidating the read- and write-path fixes that improve interop
  with samtools/htslib (TLEN computation, CIGAR =/X comparison, CIGAR
  reconstruction for sequence '*', container-with-no-slices crash,
  archive header overflow, unmapped-read query, supplementary/secondary
  read-name limitation).
- Performance entries beyond just the CRAM-internal optimizations:
  jlibdeflate integration, the BAM decoding path improvements, and a
  long-read-friendly bases-per-slice threshold.
- A bug-fix section covering the LTF8 9-byte write fix, the
  SamLocusIterator offset bug, the SamPairUtil dovetail fix, and the
  snappy native-load UnsatisfiedLinkError catch.
- A build, tooling, and dependency clean-up section: Palantir Java
  Format + Spotless enforcement, Maven Central portal migration,
  snapshot version naming, deprecation cleanup, the test-runner
  pass/fail-reporting fix, and the dependency clean-up (commons-logging
  constraint, Nashorn compileOnly, ngs-java removal).
- A compatibility line noting JDK 17 / 21 / 24 spot-checks.
- Expanded testing entries: the hts-specs CRAM 3.0/3.1 compliance
  tests, FQZComp round-trip tests, CRAI correctness tests, test-suite
  speedups, the CEUTrio test-data downsizing, and the JS filter test
  bulk-up.

Source for the additions was the full git log since the 4.3.0 tag plus
the unsquashed backup branch tf_cram_31_backup_20260425, which retains
fine-grained commits that the merged CRAM 3.1 PR squashed away.

The CRAM write-speed gains are intentionally not headlined yet --
prior htsjdk wrote CRAM 3.0 (lower compression, fewer codec passes), so
"faster" without "and same/better compression" would be misleading.
We'll revisit the perf bullet after benchmarking against samtools.
Cleans up 100 javadoc warnings flagged by the javadoc target. Fixes fall
into a few well-defined categories: unescaped angle brackets and
ampersands in doc text, malformed @link references (missing braces,
invalid this#, double-# chains), an unresolved package reference, an
unclosed @code tag, @return tags on void methods and constructors, and a
duplicate @param. Doc-only changes; no behavioral impact.
@tfenne tfenne merged commit fb57878 into master May 1, 2026
3 of 4 checks passed
@tfenne tfenne deleted the dev branch May 1, 2026 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants